LSA-based language model adaptation for highly inflected languages

نویسندگان

  • Tanel Alumäe
  • Toomas Kirt
چکیده

This paper presents a language model topic adaptation framework for highly inflected languages. In such languages, subword units are used as basic units for language modeling. Since such units carry little semantic information, they are not very suitable for topic adaptation. We propose to lemmatize the corpus of training documents before constructing a latent topic model. To adapt language model, we use few lemmatized training sentences to find a set of documents that are semantically close to the current document. Fast marginal adaptation of subword trigram language model is used for adapting the background model. Experiments on a set of Estonian test texts show that the proposed approach gives a 19% decrease in language model perplexity. A statistically significant decrease in perplexity is observed already when using just two sentences for adaptation. We also show that the model employing lemmatization gives consistently better results than the unlemmatized model.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Lemmatized Latent Semantic Model for Language Model Adaptation of Highly Inflected Languages

We present a method to adapt statistical N-gram models for large vocabulary continuous speech recognition of highly inflected languages. The method combines morphological analysis, latent semantic analysis (LSA) and fast marginal adaptation for building topic-adapted trigram models, based on a background language model and very short adaptation texts. We compare words, lemmas and morphemes as b...

متن کامل

Topic detection for language model adaptation of highly-inflected languages by using a fuzzy comparison function

A new framework is proposed to construct corpus-based topicadapted language models for large vocabulary speech recognition of highly-inflected Slovenian language. The proposed techniques can be applied to other Slavic languages, where words are formed by many different inflectional affixatation. In this article an attempt to overcome two important difficulties of highly-inflected languages (hig...

متن کامل

Bilingual-LSA Based LM Adaptation for Spoken Language Translation

We propose a novel approach to crosslingual language model (LM) adaptation based on bilingual Latent Semantic Analysis (bLSA). A bLSA model is introduced which enables latent topic distributions to be efficiently transferred across languages by enforcing a one-to-one topic correspondence during training. Using the proposed bLSA framework crosslingual LM adaptation can be performed by, first, in...

متن کامل

Looking for topic similarities of highly inflected languages for language model adaptation

In this paper, we propose a new framework to construct corpus-based topic-sensitive language models of highly inected languages for large vocabulary speech recognition. We concentrate on feature extraction process devoted to languages where words are formed by many di erent inectional a xatations. In our approach all words with the same meaning but di erent grammatical form are collected in one...

متن کامل

Rich morpho-syntactic descriptors for factored machine translation with highly inflected languages as target

The baseline phrase-based translation approach has limited success on translating between languages with very different syntax and morphology, especially when the translation direction is from a language with fixed word structure to a highly inflected language. There are two main points to improve on: morphological translation equivalence and long range reordering. Translating the correct surfa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007